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ABSTRACT 

Summary: The F2CS server provides access to the soft- 
ware, F2CS2.00, that implements an automated prediction 
method of SCOP and CATH classifications of proteins, 
based on their FSSP Z-scores (Getz et at, 2002), 
Availability: Free, at 

http: / /www. weizmann.ac.il/physics/complex / compphys / f2cs / . 
Contact: eytan.domany@weizmann.ac.il 
Supplementary information: The site contains links 
to additional figures and tables. 

Since during evolution protein structures are much 
more conserved than sequences and even functions 
[Holm & Sander, 1996], proteins are usually classified 
first by their structural similarity. Newly solved struc- 
tures of proteins are regularly stored in the Protein 
Data Bank (PDB) [Bernstein et al, 1977]. Many re- 
search groups study the diversity of protein structures 
and maintain web-accessible hierarchical classifications 
of them. Three widely used databases are FSSP 
[Holm & Sander, 1997], CATH [Orengo et al, 1997] and 
SCOP [Conte et al, 2000]; although each has its own way 
to compare and classify proteins, the resulting classifi- 
cation schemes are, largely, consistent with each other 
[Gctz et al, 2002, Getz, 1998, Hadley & Jones, 1999]. 

The major difference between these three classification 
schemes, relevant to this work, is their degree of au- 
tomation. FSSP is based on a fully automated struc- 
ture comparison algorithm, DALI [Holm & Sander, 1994, 
Dietmann et al, 2001], that calculates a structural sim- 
ilarity measure (represented in terms of Z-scores) be- 
tween pairs of structures of protein chains taken from the 
PDB. FSSP first selects a subset of representative struc- 
tures from the PDB and then applies the DALI algo- 
rithm to calculate the Z scores for all pairs of represen- 
tatives. Next, they calculate the Z scores between each 
representative and the PDB structures it represents. Be- 
ing fully automated, FSSP can be updated fairly often. 
FSSP was recently extended by a new database, called 
Dali [Holm, 2003], which contains all-against-all Z-scores 



between chains and domains of a larger representative set, 
PDB90 [Hubbard et al, 1999], in which no two chains are 
more than 90% sequence identical. In contrast, CATH and 
SCOP use manual classification at certain levels of their 
hierarchy, which slows down the classification process and 
makes it more subjective and error-prone. 

CATH arranges protein domains in a four-level hierar- 
chy according to their Class (secondary structure com- 
position), Architecture (shape formed by the secondary 
structures). Topology (connectivity order of the secondary 
structures) and Homologous superfamily (structural and 
functional similarity). Classification of Architecture is 
done by visual inspection; hence CATH is partially man- 
ual. 

The top level (Class) of the SCOP database also de- 
scribes the secondary structure content of a protein do- 
main. The next level (Fold) groups together struc- 
turally similar domains. The lower two levels (super- 
family and family) describe near and distant evolution- 
ary relationships [Levitt & Chothia, 1976]. "Fold" largely 
corresponds to CATH's topology level [Getz et al, 2002]. 
SCOP is constructed manually, based on visual examina- 
tion and comparison of structures, sequences and func- 
tions. 

We present here a web-based server, available at 
http : // www. weizmann. ac .il / physics / complex /compphys/f2cs / , 
whose aim is to predict, without human intervention, us- 
ing a protein's FSSP (or DALI) Z-scores, it's full SCOP 
and CATH classifications. This can help classify proteins 
of known structure that were not yet processed by SCOP 
or CATH (whose new releases are provided about every 
6 months), and call attention to yet unseen structural 
classes. 

If a protein appears in FSSP, the server returns our pre- 
diction. If it is not in FSSP, the user can submit the new 
structure to the DALI server, insert the resulting Z-scores 
into our server and obtain its predicted classification. In 
both cases F2CS outputs a table showing the prediction, 
along with its confidence level. 
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Table 1: Results obtained by submitting "Idowb" to the F2CS server. This protein was classified by neither CATH v2.5 nor 
SCOP 1.63 (indicated by -1 in the "number of domains" columns). We predict the following classifications: 1.20.5 for CATH 
and 8.1 for SCOP, both at 100% confidence level. 



THE SERVER 

The current predictions are based on the latest versions 
of the databases; FSSP (Jun 16, 2002 update), combined 
with the DaU database (preHminary version. May 2003); 
CATH version 2.5 (Jul 2003) and SCOP 1.63 release (May 
2003). The FSSP database contains 27182 chains, 2860 
out of which are representatives. We superimposed on 
these the Z-scores from the Dali database, which were cal- 
culated for 6433 PDB90 chains; we refer to the combined 
database as FSSP/DD. Only significant Z-scores are re- 
ported (> 2) and used; all other Z-scores are assumed to 
be zero. 

The server implements our method [Getz et al., 2002], 
Classification by Optimization (CO), an optimization pro- 
cedure that searches for that class assignment of proteins 
(that were not yet processed by CATH or SCOP), which 
attains a minimal cost. The cost of an assignment is the 
sum of Z-scores between all pairs of proteins that were 
not assigned to the same class. This is a "partially su- 
pervised" algorithm, since it utilizes for its prediction the 
labels of the proteins with known classification and also 
the Z scores among the training and predicted sets. We 
can not classify "isolated" proteins, which are not con- 
nected by a path of neighboring chains {i.e. Z > 2) to a 
chain of known classification. 

We generate a prediction database of chains which ap- 
pear in FSSP/DD but not in SCOP or CATH by ap- 
plying our algorithm for each classification scheme. The 
FSSP/DD version we are using contains 4014 chains which 
do not appear in CATH v2.5 (we supply a prediction for 
3170 of these) and 511 which are not in SCOP 1.63 (for 
403 of these wc have a prediction); 272 chains appear in 
neither CATH nor SCOP. Since CATH and SCOP han- 
dle protein domains whereas FSSP/DD entries are protein 
chains (consisting of one or more domains^), wc use as a 
training set the single domain chains that are of known 
classification. Note that SCOP and CATH do not always 
agree on their separation of proteins into domains. 

Our prediction's success rate was estimated using a 
blind test in which we hid the assignments of 3605 proteins 
from CATH and 4570 proteins from SCOP and tested our 
predictions against the known classifications. The suc- 
cess rate was tested for each class separately (see website 
for details). Due to larger number of training examples 
and more stringent criteria for attempted classification, 
the success rate has improved over our previous work. 

With every new release of the databases, F2CS can be 
updated; the newly released CATH/SCOP classifications 
are added to the training set, while predictions are made 
for proteins contained in a new FSSP/DD release which 
aiv not yet dassifi(-d by CATH or SCOP. 

^Wc do not classify the few cases, when a single domain contains 
several different chains or a combination of their parts. 



USAGE 

In order to retrieve our prediction for CATH's class, archi- 
tecture and topology or SCOP's class and fold of a protein, 
enter the protein chain's identifier in the search box and 
submit the query. If the protein appears in our database, 
a table will be returned containing both the known and 
the predicted SCOP and CATH classifications. For ex- 
ample, submission of the chain identifier "Idowb" , which 
was classified neither by CATH v2.5 nor SCOP vl.63, re- 
turns Table 1. We predict CATH classification 1.20.5 and 
SCOP 8.1, both near 100% confidence level. The "Suc- 
cess%" link points to a table with the exact numbers by 
which the success rates were estimated. 

In case the queried protein is not in our database, the 
user can obtain its predicted classification by following 
these two steps: (a) submit the protein's PDB file to the 
DALI server (the engine behind FSSP) which calculates 
its structural similarity to the FSSP representatives and 
returns a list of the representatives and Z-scores for which 
Z > 2. (b) Paste DALI's reply in the appropriate query 
box in our server. 
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